Expectation Maximization for Clustering on Hyperspheres

نویسندگان

  • Arindam Banerjee
  • Inderjit Dhillon
  • Joydeep Ghosh
  • Suvrit Sra
چکیده

High dimensional directional data is becoming increasingly important in contemporary applications such as analysis of text and gene-expression data. A natural model for multi-variate directional data is provided by the von Mises-Fisher (vMF) distribution on the unit hypersphere that is analogous to multi-variate Gaussian distribution in R. In this paper, we propose modeling complex directional data as a mixture of vMF distributions. We derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the parameters of this mixture. We also propose two clustering algorithms corresponding to these variants. An interesting aspect of our methodology is that the spherical kmeans algorithm (kmeans with cosine similarity) can be shown to be a special case of both our algorithms. Thus, modeling text data by vMF distributions lends theoretical validity to the use of cosine similarity which has been widely used by the information retrieval community. We provide several results on modeling high-dimensional text and gene data as experimental validation. The results indicate that our approach yields superior clusterings especially for difficult clustering tasks in high-dimensional space.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian K-Means as a “Maximization-Expectation” Algorithm

We introduce a new class of “maximization expectation” (ME) algorithms where we maximize over hidden variables but marginalize over random parameters. This reverses the roles of expectation and maximization in the classical EM algorithm. In the context of clustering, we argue that these hard assignments open the door to very fast implementations based on data-structures such as kdtrees and cong...

متن کامل

On Initialization of the Expectation- Maximization Clustering Algorithm

Iterative clustering algorithms commonly do not lead to optimal cluster solutions. Partitions that are generated by these algorithms are known to be sensitive to the initial partitions that are fed as an input parameter. A “good” selection of initial partitions is an essential clustering problem. In this paper we introduce a new method for constructing the initial partitions set to be used by t...

متن کامل

Chemical Reaction Algorithm for Expectation Maximization Clustering

Clustering is an intensive research for some years because of its multifaceted applications, such as biology, information retrieval, medicine, business and so on. The expectation maximization (EM) is a kind of algorithm framework in clustering methods, one of the ten algorithms of machine learning. Traditionally, optimization of objective function has been the standard approach in EM. Hence, re...

متن کامل

Scaling-Up Model-Based Clustering Algorithm by Working on Clustering Features

In this paper, we propose EMACF (Expectation-Maximization Algorithm for Clustering Features) to generate clusters from data summaries rather than data items directly. Incorporating with an adaptive grid-based data summarization procedure, we establish a scalable clustering algorithm: gEMACF. The experimental results show that gEMACF can generate more accurate results than other scalable cluster...

متن کامل

Similarity based clustering using the expectation maximization algorithm

In this paper we present a new approach for clustering data. The clustering metric used is the normalized crosscorrelation, also known as similarity, instead of the traditionally used Euclidean distance. The main advantage of this metric is that it depends on the signal shape rather than its amplitude. Under an assumption of an exponential probability model that has several desirable properties...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003